Compilation of Specialized Comparable Corpora in French and Japanese

نویسندگان

  • Lorraine Goeuriot
  • Emmanuel Morin
  • Béatrice Daille
چکیده

We present in this paper the development of a specialized comparable corpora compilation tool, for which quality would be close to a manually compiled corpus. The comparability is based on three levels: domain, topic and type of discourse. Domain and topic can be filtered with the keywords used through web search. But the detection of the type of discourse needs a wide linguistic analysis. The first step of our work is to automate the detection of the type of discourse that can be found in a scientific domain (science and popular science) in French and Japanese languages. First, a contrastive stylistic analysis of the two types of discourse is done on both languages. This analysis leads to the creation of a reusable, generic and robust typology. Machine learning algorithms are then applied to the typology, using shallow parsing. We obtain good results, with an average precision of 80% and an average recall of 70% that demonstrate the efficiency of this typology. This classification tool is then inserted in a corpus compilation tool which is a text collection treatment chain realized through IBM UIMA system. Starting from two specialized web documents collection in French and Japanese, this tool creates the corresponding corpus.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Looking for Transliterations in a Trilingual English, French and Japanese Specialised Comparable Corpus

Transliterations and cognates have been shown to be useful in the case of bilingual extraction from parallel corpora. Observation of transliterations in a trilingual English, French and Japanese specialised comparable corpus reveals evidences that they are likely to be used with comparable corpora too, since they are an important and relevant part of the common vocabulary, but they also yield l...

متن کامل

Neoclassical Compound Alignments from Comparable Corpora

The paper deals with the automatic compilation of bilingual dictionary from specialized comparable corpora. We concentrate on a method to automatically extract and to align neoclassical compounds in two languages from comparable corpora. In order to do this, we assume that neoclassical compounds translate compositionally to neoclassical compounds from one language to another. The method covers ...

متن کامل

Compiling French-Japanese Terminologies from the Web

We propose a method for compiling bilingual terminologies of multi-word terms (MWTs) for given translation pairs of seed terms. Traditional methods for bilingual terminology compilation exploit parallel texts, while the more recent ones have focused on comparable corpora. We use bilingual corpora collected from the web and tailor made for the seed terms. For each language, we extract from the c...

متن کامل

Anchor points for bilingual lexicon extraction from small comparable corpora

We examine the contribution of reliable elements in French– and English–Japanese alignment from comparable corpora, using transliterated elements and scientific compounds as anchor points among context-vectors of elements to align. We highlight those elements in context-vector normalisation to give them a higher priority in context-vector comparison. We carry out experiments on small comparable...

متن کامل

Towards a Generic Approach for Bilingual Lexicon Extraction from Comparable Corpora

This paper presents an approach that extends the standard approach used for bilingual lexicon extraction from comparable corpora. We focus on the problem associated to polysemous words found in the seed bilingual lexicon when translating source context vectors. To improve the adequacy of context vectors, the use of a WordNetbased Word Sense Disambiguation process is tested. Experimental results...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009